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Abstract 

We present a theory of boosting probabilistic classifiers. We place ourselves 
in the situation of a user who only provides a stopping parameter and a prob- 
abilistic weak leamer/classifier and compare three types of boosting algorithms: 
probabilistic Adaboost, decision tree, and tree of trees of ... of trees, which we 
call matryoshka. "Nested tree," "embedded tree" and "recursive tree" are also ap- 
propriate names for this algorithm, which is one of our contributions. Our other 
contribution is the theoretical analysis of the algorithms, in which we give training 
error bounds. This analysis suggests that the matryoshka leverages probabilistic 
weak classifiers more efficiently than simple decision trees. 

1 Introduction 

Ensembles of classifiers are a popular way to build a strong classifier by leveraging sim- 
ple decision rules -weak classifiers. Many ensemble architectures have been proposed, 
such as neural networks, decision trees, Adaboost [ 1 1, bagged classifiers |2 1, random 
forests 1 3 1, trees holding a boosted classifier at each node |4|, boosted decision trees... 
One drawback of ensemble methods is that they are often dispendious about the compu- 
tational cost of the resulting classifier. For example, Adaboost 1 1 1, bagging 1 2 1, random 
forests 1 3 1 all multiply the runtime complexity, by a factor approximately proportional 
to the training time. This is not acceptable in applications involving large amounts of 
data and requiring low-complexity method, such as video analysis and data mining. 

Many approaches have been proposed to deal with such situations. The cascade 
architecture, i.e. a degenerate decision tree, has become very popular 1 5 1 and has been 
intensely studies J5|Q. However, cascades are mostly appropriate to detect rare exem- 
plars of interest amongst a huge majority of uninteresting ones. 

Decision trees, on the other hand, are better adapted to the case of balanced target 
classes. This advantage comes from their greater facility to decompose the input space 
into more manageable and useful subsets. In addition, their run-time complexity is 
approximately proportional to the logarithm of the training time. Counterbalancing 
these advantages, is the fact that decision trees tend to overfit the training data. 

There exist many proposed methods to improve overfitting, for example pruning 
and smoothing, but the main recognized cause remains: data elements are passed to 
one only of the descendant of each node, whether during training, or at run-time. At 
run-time, one proposed solution is to pass examples along more than one child node 1 8 1. 
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Figure 1: Left: Construction of a matryoshka decision trees. Here, each tree has just 
two nodes, but other numbers are possible. Right: notation use to specify a path in a 
decision tree. 

During training, it has been proposed |4| to pass down to all descendants the exemplars 
that lay within a fixed distance of the separating surface. 

Our approach to avoid the hard split at each node is to consider that the examples 
have a certain probability -not necessarily always or 1- of being passed to any de- 
scendant of the tree. That is, we study probabilistic decision trees |9|, but pursue a 
different analysis from these last works. First, we show that probabilistic decision trees 
are eminently tractable within the framework of boosting 1 . 

We bound the expected misclassification error as a function of the number of nodes 
in the tree, in Section^ This bound is very high when the probabilistic weak classifiers 
are very weak. Moreover, we present arguments that suggest that any bound using the 
same probabilistic weak learner hypothesis will necessarily be high. 

However, we also note that the bound achieved with stronger weak classifiers is 
much better. In an attempt to strengthen our weak classifiers, we explore the possibility 
of assembling decision trees consisting of decision trees, the inner and the outer trees 
being built by the same algorithm. This is not the first time this idea is suggested, but 
we believe we are the first to show the theoretical benefits of doing so. 

Continuing on the idea of embedding (or nesting) decision trees one into another, 
we propose to assemble trees of trees of ... of trees of probabilistic weak classifiers. 
This is similar to matryoshka dolls, with the difference that each tree contains more 
than one tree, rather than a single other doll. Figure^ left, illustrates this concept. Our 
main contribution (Section l5T2l is to prove a greatly improved bound, reached by trees 
with exactly two nodes, each node being a tree with two nodes, and so on until the last 
nesting level, which holds two probabilistic weak classifiers. 

Another merit of our study is that it proposes a methodology that is essentially pa- 
rameterless. The user only needs to provide a probabilistic weak learner and a stopping 
criterion, such as the number of nodes or the error on the training dataset. If a stop- 
watch 2 , is available during training, then we propose ways of using it. The freedom of 
parameters results partly from applying a principle of greedy error minimization. 

Before presenting our study on decision trees, we define, in Section |2] the prob- 
abilistic weak learners that are the basis of this work. We then present, in Section [3] 

'The analogy between deterministic decision trees and boosting has been studied in 1101 . 
2 This metaphor is to say that, if the time complexity of the learner and classifier are known or measurable, 
then this information can be used to greedily reduce the training error. 
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the probabilistic equivalent of Adaboost that will serve as reference for the rest of the 
article. After presenting our main theory in Sections [4] and |5j we discuss further the 
findings of this study and open directions for future research. 

2 Probabilistic weak learner 

We consider learning algorithms A that, given a training dataset S = { (X\ , yi , D(l) ) , 
(X 2 ,y 2 ,D(2)),..., (X N ,y N ,D(N))}, with D(l) + ... + D(N) = 1, return a 
probabilistic classifier, or oracle, written h (X). For any input X, h (X) is identified 
with a Bernoulli random variable with parameter q (+, X). 

Definition: We say that A is a probabilistic weak learner, if there exists a constant 
< e < \ such that, for any dataset S, the expected error of h (X), is smaller 
than i — e; that is, one has: 



where q (~y n ,X n ) is the probability that h (X n ) takes the value — y n , i.e. that 
the classifier is wrong. 

The constant e, called the advantage or edge is unknown and does not need to be 
known. The probability q(y n ,X n ), also unknown, will be needed. We estimate 
it by calling repeatedly the weak classifier and calculating the maximum likelihood 
(ML) or maximum a-posteriori (MAP) estimates: If hi >n , . . . , /ir,« are the values re- 
turned by R invocations (observations) of h (X n ), then the ML estimate isq (y, X n ) — 
\{n | h r:n — y}\ / R, where |.| is the set cardinal. Assuming that q(y,X n ) is uni- 
formly distributed in [0, l] 3 , the MAP is q (y, X n ) ( l + \i n I h r,n = u}\)- 

3 Adaboost for probabilistic weak learners 

We now adapt Adaboost 1 1 1 to probabilistic -rather than deterministic- weak learners. 
Like the original Adaboost, we consider classifiers of the form 



but here, h t (X) is a Bernoulli random variable (or randomized classifier or oracle), 
so that H (X) is itself a random variable. Like in Adaboost, we consider domain- 
partitioned weights (see Q] Sec. 4.1]): we have constants at,+ and a^- such that 
ott,x = &t,+ if ht (X) is observed to be +1, and that a t ,x = &t,- otherwise. 

We proceed as in Adaboost, increasing the number T of weak classifier, and not 
changing a weak classifier once it has been trained. Each random classifier ht {X) is 

3 This prior is pessimistic, since q (y n ,X n ) has (unknown) expectation smaller than i, but the edge e 
being unknown, using another prior would not be less hazardous. 




T 



H(X) = J2^t,xh t (X), 
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obtained by running the weak learner on the training data set (Xx, y\) ,. . . ,(Xn, Un), 
with weights D t (n) , 1 < n < N chosen to emphasize misclassified examples. The 
weight update rule is: 

D t+ i (n) = D t (n) (q (t, +, X n ) e ~ a ^ + q (t, - X n ) e a *.-»») /Z u (1) 

where Z t = Y^=i D t («) (? (*. +> X n) e- a ^ + q (i, -, X„) e a *.-«») normalizes 
the weights so they sum to one. 

With a deterministic weak classifier, one would have q (±, X n ) 6 {—1, +1}, re- 
sulting in the original Adaboost weight update rule. An additional difference is that the 
q (±, X n ) are unknown. We address this issue in Sec. l3.2l and assume for now that the 
we have estimates q (±, X n ). 



3.1 Boosting property 

We now give an upper bound for the expected misclassification error of H (X), and 
show how to choose the weights at.±- This derivation parallels that of 1 1 1: the expected 
training error is 

N N 

E (Loss) = Y / D(n)E {{H (X n ) + y n j) < £ D (n) E ( e -*(*»)v») . (2) 



where [.] is the "indicator function," being 1 if the bracketed expression is true and 
zero otherwise. 

Since H (X n ) may take at most 2 T possible values, depending on the outputs St, 
1 < t < T, of the T classifiers h t (X n ), one has: 



E(e- H ^) = £ f[q(t,s t ,X n ) 



-ott,s t y 

T-1 



S1,...,S T -1 t = l 

T-2 



^ — ' AJ - Dt-i\jl) Dt (n) 



Sl,...,ST-2 t=l 

. . . etc . . . 

£>T+1 («■) 

" t ■ 



1 v ' t=i 

Summing over all samples, one gets the familiar expression 

T 

E'(Loss) < Y[Z t . 

t=i 

The rest goes as with Adaboost: each Z t is minimized by setting 
1, (W++\ 1 fWf 



4 



Slightly expanded version of my ECML 2006 submission 



where Wf = Y, n \y n =b D t («) Q (*, a > »). for an Y « G {+1. -!},&€ {+1, -1} . For 
this choice of a* ±, one has Z t = 2 v / W t ++ W t + ~ +2 V / W t ~ + W t ~-, and one can show 
that Z t < Vl — 4e 2 = p. The expected error of the T-stage boosted probabilistic 
classifier thus has the same bound as the error of Adaboost: 

E (Loss) < (V 1 - 4e 2 ) T = p T (3) 

It must be made clear that, in practice, during training, the users only have estimates 
of q(t,±, n), so that they reduce an estimate of the bound of the expected error. 

3.2 Estimation of q (±, X n ) during training 

In this section, we show how users can, in practice, balance their need for accurate 
estimates of the q (t, ±, n) with their eagerness to reduce the estimate on the bound of 
the expected error (the reader may skip this part in a first reading). 

The difference with respect to Adaboost is that, once the classifier has been trained, 
the user has to estimate the q (T, ±, n), in order to compute D T +i (n) for the next 
classifier. The question is thus "how many samples of h T (X n ) should be taken?" The 
trivial answer, which we exclude, is to fix some number R of samples and use the 
corresponding MAP or ML estimates. We exclude for now this approach, to avoid 
adding an extra parameter R. We propose, instead, two approaches based on the MAP 
estimator of q (t, ±, n). 

Let us first compare the MAP and ML estimators, to later better explain our prefer- 
ence for the MAP. Both the MAP and ML converge in probability to the true value, so 
that the corresponding estimators of Z t also converge in probability to the true value. 
The MAP and ML differ in that the MAP estimator of q(T,±, n) is biased towards \, 
and that of Z T is biased towards 1. More precisely, the expected value of these MAP 
estimators converge to the their limits from above, so that the expected value of succes- 
sive estimates of Z T decreases towards the true value. Thus, after sampling h T (X n ) 
R times, sampling once more is always expected to decrease the estimate of Zt- 

First approach to estimate q(T,±,n): The MAP thus has the advantage of provid- 
ing a natural stopping time, that of the first observed increase in our estimate of Zt- 
The event that Zt increases has a probability that increases towards 1/2, so that it will 
almost always (in the probabilistic sense) happen after a finite time. This strategy can 
also be used with the ML estimator, but, having a greater variance, it is more likely to 
result in a spuriously low estimate of Z T and early stops. On these grounds, the MAP 
should thus be preferred over the ML. 

Second approach: An alternative method involving some look-ahead, and the user's 
stopwatch, may be also be considered: having until now trained T classifiers and sam- 
pled R times /it (X n ) , 1 < n < N, the user has the following options: 

A Train a new classifier /vr+i (X), using the current estimate of q (T, ±, n) in the 
calculation of D T +i (n). Then sample Ht+i {X n ) , 1 < n < N once, resulting 
in a first MAP estimate of Zt+i- As a result, the user decreases the estimated 
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bound, now Ilf<T+i previously Ilt<T by the factor Zt+i- Also, the user 
measured, with his or her stopwatch, the elapsed time Sa during training and 

1/5 

sampling. The instantaneous bound decrease rate per unit of time is (Zt+i) a ■ 

B Sample once more Yit (X n ) , 1 < n < N, producing a new estimate Z' T . As a 
result, the user decreases (or increases) the estimated bound by a factor Z' T jZT- 
Again, with his or her stopwatch, (s)he measured the elapsed time Sb- The 
instantaneous bound decrease rate is [Z' T / Zt) 1 ^ 3 ■ 

Finally, based on the smallest bound decrease rate, the user decides whether to keep 
the new classifier hx+i (X) or the new estimate Z' T . 

We have thus proposed two parameterless ways to estimate q (T, ±, n). 



4 Boosting decision tree 

Having shown how Adaboost can be transposed to probabilistic weak learners, we now 
further extend our study to probabilistic decision trees. 

Computationally, our proposed classifier is a smoothed binary decision tree. In that 
model, the output H (X) is a weighed sum of the (random) classifiers on the nodes 
traversed by an input element X : 

T(X) 

H(X)=J2 a s(t,x)h s{t -i, X ) (X) , (4) 
t=i 

where s (t, X) is the index of the t^ 1 node reached by input X, h s u~i x) (X) 6 
{ — 1, 1} is the output of the corresponding classifier and T (X) is the depth of the 
last inner node reached by X before exiting the decision tree. The weight given to 
h>s(t-i,x) (X) , a s ( tt x) is domain-partitioned, since it depends on the observed value 

of h si t-\, X ) (x). 

Some notation is needed: the index of a node s is a sequence of "+" and "— ", 
indicating the path to that node. For example, in Fig.^ right, s = (+,+,—) is the leaf 
reached by following the "+" edge out of the root node, then the "+" edge out of the 
(+) node, then the "— " edge out of the (+, +) node. The output of the classifier, when 
an input exits the decision tree by s, is thus H (X) = a + + a ++ + 

For additional convenience, we write s the index of the parent of node s (s — 
(+, +) in the previous example) and s the last edge followed to reach s (here, s = — ). 
Thus, one may write s = (s, s). The root node is s (0, X) = 0. With this notation, and 
noting that H (X) only depends on the leaf I (X) reached by X, one has: 

H(X)= J2 a ^ = H i> ^ 

Hl<s<l(X) 

where the sum is taken over all nodes s between the leaf I and the root (exclusive). 
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4.1 Decision trees with probabilistic nodes 

Like most other decision tree -building algorithms II II 1121 . we add nodes one at a 
time and do not modify previously added nodes. This is the most common way of 
avoiding the inherent complexity 1 13 1 of building decision trees. We do not consider a 
subsequent pruning step. Unlike other decision trees, and like in Adaboost, each node 
is trained on the whole dataset. 

However, we modulate the weights of the examples, not only based on whether 
they are misclassified (as in Adaboost), but also based on their probability of reaching 
the node. After having trained h s (X) with weights D s (n), 1 < n < N, the weights 
for training the children nodes s+ and s— are: 

D s+ (n) = ^H!«(«+ ( J! fl ) e -^ and 7J S _ (n) = -^-q (s-, X n ) e"--^. 

(6) 

In this expression, Z sa = ^ n=1 D s {n)q{sa,X n )e~ asaVn , for a e {+,—}, are 
normalizing constants, and q(s+,X n ) 6 [0,1] is the (unknown) parameter of the 
Bernoulli random variable h s (X). Like in Sec. [3] we use estimates q(s+,X n ) in 
place of the true values. 

4.2 Bound on the expected error 

We now bound the error of the boosting tree algorithm and specify the weights a s and 
the choice of the trained node at each step. 

Using again the exponential error inequality Loss (H (X) , y) < e~ H< - x ^ y , Eq. 0, 
the expected misclassification error for a training example X n is upper-bounded by 

E (e-*l x *>')= Yl P(l,Xn)e- H <y~, (7) 
; : leafff 

where p (I, X) is the probability of an input X reaching the leaf I. More generally, 
assuming independence of the outputs of classifiers at each node, the probability that 
X reaches a node s = (s%, s%, . . . , St) is 

V {a, X)=q(s u X)-q ( Sl s 2 ,X) ■ . . . ■ q ( Sl . . . s D ,X) = J] q (r, X) , 

where the product is taken for all nodes between the root and s. 
The error bound is thus 



E 



zTeafff \*<i / 
zTeaf h s < 1 

tleaf hs<i v v ; 
zTeafff s <' 
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Summing over all examples X n , 1 < n < N and replacing in Eq. (fj} yields the bound: 

£(Loss)< ]T \{Z S (8) 
Zileaf h s < 1 

Like above, each Z s is minimized by setting 

a s = -log(— jM, where Wf = ]T D, (n) q (a, n) , a, b G {+, -} , 
and -i is the negation operator. For these values of a s , each Z s is takes the value 

This bound can also be found, in slightly different contexts, in our previous work 1 1 4 1 
and in our unpublished manuscript [ 15 1. In the present paper, we additionally study 
how this bound evolves with the size of the tree. 



Expected error bound as a function of the tree size 

We now describe the evolution of the bound (|8} when the tree is grown by a greedy 
bound-reducing algorithm. 

As previously, we may show that Z s+ + Z s - < vl — 4e 2 = p, owing to the 
probabilistic weak learner hypothesis. 

We now proceed recursively. After training and incorporating T nodes, the ex- 
pected error bound is C (T) — J2i. leaf h I\s<i At this point, the tree has T + 1 
leaves, so that one leaf I at least has an error not less than C (T) / (T + 1). After 
training a probabilistic weak classifier at I, the new error bound is 

c(t+i) = c(T)~i[z s + n z s + n Z s 

S<1 S<-1— S<1 — 

= C (T) + ^[[z)j (-1 + Z l+ + Z t _) 



C(T) 



T + p 



T + l 

Since C (0) = 1, we have the general relation 

C (T)<f] t -±P = L*F(7»^^4, (9) 

t=o t + 1 TB ( T >P) T (P) 

where B (T, p) is the beta function and T (p) is the Gamma function. The rightmost 
term is the asymptotic approximation for large T; it is coherent with the bound of d[ 10 
Eq.6]. 

This bound is interesting in more than one respect: 
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Bound of tree, Adaboost and matryoshka Bound for tree of trees (T=100) Bound for trees of trees (T=10000) 




Figure 2: Left: Bound of boosted decision tree (full curve, highest), Eq. l|9}, of proba- 
bilistic Adaboost (dotted, lowest), Eq. Q, and of matryoshka (dashed, middle). From 
top to bottom, p € {§ ie - e e {0.12,0.24,0.33,0.43,0.46}. Right: 

Bound of boosted tree of simple trees, given by Eq. dlOt . 



- It appears that it cannot be very much improved, in the following sense: consider 
a probabilistic learner with error 1/2 — e, independently of the weights D (n) 
with which it is trained. This learner verifies the probabilistic weak learner hy- 
pothesis. Now, for both the probabilistic Adaboost and for a (balanced) decision 
tree, H (X) is a binomial random variable with parameters (1/2 — e), and the 
number of weak parameters traversed by X. This second parameter is T for Ad- 
aboost and log 2 (T) for the decision tree. It is clear, then, that the decision tree 
requires exponentially more weak classifiers than the probabilistic Adaboost. 

- This bound is especially bad for very weak classifiers (p ~ 1). The full curves 
in Figure □ left, plot the bound F (T, p) for p = 31/32, 7/8, 3/4, 1/2 and 1/4. 
For comparison, the expected error bound of Adaboost, p T , plotted alongside, is 
much lower, especially for p = 31/32. 

- This bound calls the attention of designers of decision trees tempted to pass all 
the training dataset along all branches: if the weak classifier is very weak, the 
number of needed weak classifiers may grow very much. With stronger classi- 
fiers, the boosting tree algorithm may be more practical. 



5 Matryoshka decision trees 

Based on the conclusion of the previous section -that stronger classifiers yield better 
boosted decision trees, we now address the question of obtaining sufficiently strong 
classifiers. The first step in this direction (Section fe-H is to explore the idea of putting 
a boosted tree at each node. We will see that there is an advantage in doing so. It will 
then be natural, in Section l572l to build trees of trees of trees of ... of weak classifiers, 
that is, a matryoshka of decision trees. 

5.1 Bound for a tree of trees 

In this section, we study the error bounds obtainable by a decision tree built using the 
method of Section [4] but where the nodes are themselves trees built according to that 
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same method. We place ourselves in the situation of having the resource to train a 
fixed number T of weak classifiers, and our objective is to minimize the bound on the 
expected error. 

In this context, it is natural to study the bound obtainable by assembling T2 sub- 
trees of fixed size T\ — T/T2. By Eq. (|9jl, the error bound for the sub-trees is 
F {Tup) = {TiB (Ti, p))~\ and that of the outer tree is 



Figure|5J right, plots this bound plotted against T\. The curves show that, for T\ = 1 
and Ti = T, the bound is the same as F (T, p), i.e. that of a not-nested decision tree. 
More interestingly, for intermediate values of Ti, the bound of Eq. dlOl is always lower 
than F (T, p). In particular, the minimum is always near Ti = \T. 

Given these encouraging results, we are naturally tempted to substitute the sub- 
trees (of size Ti) byT{ sub-trees of sub-sub-trees of size T", for some T{, T'{ s.t. 
T[T[' = Ti. The same idea can also be applied to the outer tree. 

5.2 Bound for a tree of trees ... of trees of weak classifiers 

More generally, we are tempted to determine the bounds reachable by trees of trees of 
... of trees of weak classifiers. For some L and Ti,T 2 ,. . .T L s.t. TiT 2 ...T L =T, the 
bound is easily shown to be: 



Finding analytically the optimal combination of Tj, 1 < i < L may not be easy. 
But, guided by the observation that, for L = 2, the optimal choice seems to be near 
T x = T 2 = y/T, we naturally consider the case Ti = T 2 = . . . = T L = T x l h . In this 
case, the bound is 



The black graph in Figure [3] plots this value against the nesting level L, with the 
original bound F (T, p) (top) for comparison. This figure clearly shows that deeper 
nesting levels improve the bound. In fact, Eq. ( II \\ continues to decrease for L > 
log 2 T, i.e. when the trees each have less than two nodes. 

This (strange) effect is due to the fact that F (T, p) is defined for any positive real 
T. Since the number of nodes is in an integer, there are no practical repercussions. 

However, these curves clearly indicate that smaller sub-trees yield better bounds. 
This suggests building the smallest possible trees, with just two nodes, each node a tree 
with two nodes, etc, until the last level, consisting of trees with two weak classifiers. 

5.3 Bound for 2-matryoshka 

We now derive the expected error bound for the "2-matryoshka" tree, having exactly 
two nodes, at all nesting levels, having precisely two nodes. We thus need to assume 
that T = 2 L is a power of two. 




(10) 



F(T L ,F(T L -i,...F(Ti,p))). 




(11) 
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Bound on error of nested trees vs. nesting level Bound on error of nested trees vs nesting level 
T=2*10 t=2 a 16 

!| ! ! 1 1 1 1 I i , , , , , , , , 1 
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Figure 3: Black curve: bound on error of matryoshka decision trees at various levels of 
nesting. The sub-tree sizes are ^/(nesting level) Ljgjjt-colored curves near the black 
curve are for trees w/ integer number of nodes. . The topmost line marks the error 
bound of the (not nested) decision tree. At left, T = 1024, and T = 65536 at right. 
These curves are for p — || i.e. e ~ 0.12. 

We call M 2 (T, p) = F (2, F (2, . . . F (2, p))) the bound for this tree (there are L 
nested parentheses). Recalling from Eq. (|9} that F (2, p) — p^-^- — \p + \p 2 , one 
writes M2 (T, p) as a polynomial of degree T. 

Figure|2] left, shows the graph of Mi (T, p), in dashed lines. This figure shows that 
the 2-matryoshka tree has a much stronger boosting ability than the plain boosting tree, 
and this is the main result of this paper. 



5.4 Building a matryoshka 

The algorithm for the 2-matryoshka would thus be: train a two-leaf tree (stage "b", in 
Fig.[0, and collect the leaves into a single node ("c"). Train a two-leaf sub-tree on one 
of the branches, collect its leaves in a single node ("e")- Collect the leaves once more 
("f") etc. If all weak classifiers have the same edge e, then this approach is the most 
appropriate. 

In practice, the classifiers will not have the same edge and a greedy -with respect 
to number of nodes or physical training time- bound-decreasing approach could be 
considered. Each time a classifier is added to the tree, we will consider each sub- 
tree containing that node, starting from the top. For each sub-tree, we compare the 
instantaneous bound decrease rate 4 of the sub-tree at T, 

C Simp ie^(C(T + l)-C(T-l))/2, (12) 

(C (T) being computed on the sub-tree only), with that of a tree having such a sub-tree 
at each node, 

CMatryoshka = C {T)^j (t = T) . (13) 

If the later is smaller, then the leaves of the sub-tree are collected into a single node. 

We now give the detail of computing Eq. (II 31 . Using the relation -j^B (x, y) = 
B (x, y) {ip (x) — ip (x + y)), where ip is the digamma function, ip (x) = ^ (log (T (x))), 

4 Here, we consider the decrease rate per added node, but the decrease rate per unit of training time could 
be used too. 
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one gets 

Ft {T, P) = -F (T, p) (± + $ ( T ) — ip(T + pfj and 
F' p (T,p) = -F(T,p)(ip(p)-^(T + p)). 
The first line above then gives 

l-F (|, C (T)) it = T) = m ( 7 + * (C (T)) + ^ - l) , (14) 

where 7 = — ip (1) ~ 0.5772 is Euler's constant. 

One can check that, for T = 1, C 'simple = C Matryoshka and that, if C (T) = 
F (T, p), i.e. if the bound (|9j is tight, then C Slmp i e = CMatryoshka for all T > 1. 

6 Discussion and conclusions 

We have developed in this paper a theory of probabilistic boosting, aimed at decision 
trees. We proposed a boosting tree algorithm and a theoretically superior matryoshka 
decision tree algorithm. These algorithms are essentially parameter-free, owing to the 
principle of choosing whichever training action most reduces the expected training 
error bound, and to a judicious choice of possible training actions. 

We showed bounds on the expected training error of the algorithms, one of them 
discouraging, the other, encouraging. The bounds for simple trees and for trees of trees 
are coherent with our early experiments. 

Future developments include an analysis of the effect of approximating the node 
branching probabilities q (s,X n ) during training and experimental evaluation of the 
matryoshka. 

On a more general level, we believe that the high bound for boosting trees indicates 
that the probabilistic weak learner hypothesis is inadequate. This hypothesis, directly 
adapted from the theory of boosting, does not take into account the fact that real-world 
classifiers usually have a lower training error on smaller training sets. Our intuition is 
thus that the entropy of the training weights, D (n), should be taken into account in 
future work. 
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